DOCUMENT RESUME 



ED 064 361 



TM 001 529 



AUTHOR 

TITLE 

PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Oosterhof, Albert C. ; Glasnapp, Douglas R. 
Comparative ReliaDilities of the Multiple Choice and 
True-False Formats. 

Apr 72 

5p. ; Paper presented at the Annual Meeting of the 
American Educational Research Association (Chicago, 
Illinois, April 1972) 

MF-$0.65 HC-$3.29 

’{'Comparative Analysis; ^Guessing (Tests) ; ^Multiple 
Choice Tests; Ratios (Mathematics) ; ^Student 
Evaluation; ’{'Test Construction; Test Reliability 
’{'True False Tests 



ABSTRACT 

The present study was concerned with several 
currently unanswered questions, two of which are: what is an 
empirically determined ratio of multiple choice to equivalent 
true-false items which can be answered in a given amount of time?; 
and for achievement test items administered within a classroom 
situation, which of the two formats under consideration result in 
greater reliability per unit of testing time? Subjects were 101 
undergraduates enrolled in one section of an introductory 
meac jreraents course. Forty multiple choice items were selected on the 
basis of their relationship to stated course objectives and according 
to their ability to discriminate between levels of achievement. Data 
from this research indicate that true-false items, particularly those 
items which are in fact true, result in a less reliable test than had 
a four-option multiple choice format been used. It also appears that 
when the correction for guessing formula is applied in order to 
equalize scores relative to items correctly answered on a pure chance 
basis, the multiple choice item is the easier of the two formats to 
answer, with items keyed true easier than those keyed false with 
regard to the true- false format. (Avithor/LS) 
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Considerable discussion has taken place among measurement specialists 
regarding the virtues of multiple choice versus true-false test Item 
formats. Recent contrasting examples might Include "...the advantages 
attributed to (true-false Items) are not, unfortunately, very valid.... 

(Gronlund, 1971, p. 160)”, and ”...a few (test specialists) see special 
virtues of efficiency and ease of preparation In (true-false Items) and 
advocate their wide use (Ebel, 1971, p. D.” The most obvious limitation 
of true-frJse relative to multiple choice test Items Is the degree to which 
the former Is subject to guessing. Several studies have shown that the 
reliability of a test is directly related to the number of choices 
per Item (Remmers, Karslake, and Gage, 1940; Lord, 1944; Carroll, 1945; 

Plumlee, 1952). Similarly, It would be expected that a multiple choice 

test would have greater reliability than a true-false test If the number 

of Items were held constant. However, since a greater number of true- 

false Items can be administered per unit time. It Is possible that In 

a given amount of time, the Increased number of true-false Items 

administered would allow for greater reliability and more efficient 

sampling of content objectives than had a multiple choice format been 

used . ^ 



Using 88 multiple choice Items from a published test In natural 
science, Ebel (1971) compared rormats by rewriting each multiple 
choice Item as a parallel true-false Item. Two forms, each consisting 
of 44 multiple choice and 44 true-false Items, were developed. Reliabilities 
(K.R. 20) were computed for the multiple choice and true-false sections \ 

of both forms, and assuming that two true-false Items could be answered per ^4 

multiple choice Item, the Spearman-Brown formula was used to predict the 
reliability of an 88 Item true-false test. For the first form, this 
adjusted reliability was greater than the reliability obtained for the 
multiple choice section of the test, however the Inverse was true with 
respect to the second form. 

The present study was concerned with several currently unanswered 
questions. First, what Is an empirically determined ratio of multiple 
choice to equivalent true-false Items which can be answered In a given 
amount of time? Second, for achievement test Items administered within 
a classroom situation, which of the two formats under consideration 
result In greater reliability per unit of testing time? Third, what 
Is the relative reliability of true true-false and false true-false 
Items when compared to multiple choice Items? Fourth, what ratio of 
multiple choice to equivalent true-false Items Is necessary for producing 
equal reliability coefficients? Lastly, after equating for differences 
In the effect of guessing, what Is the relative difficulty of the 
different formats? 



paper presented at the Annual Meeting of the American Educational 
Research Association, Chicago, Illinois, April 3-7, 1972. 




Method 



One-hundred one undergraduates enrolled in one section of en 
Introductory measurements course served as subjects (Ss). Forty 
multiple choice Items were selected from an Item pool on the basis of 
their relationship to stated course objectives, and according to their 
ability to discriminate between levels of achievement. Only Items 
which consisted of one correct option and three Independent and 
Incorrect options were used. Each multiple choice (MC) Item was rewritten 
as a true-false Item keyed true (Tf) by combining the stem and correct 
option, and also as a true-false Item keyed false (tF) by combining 
the stem and the best discriminating Incorrect option. An example of 
a MC item and corresponding Tf and tF Items Is provided In 1 1 lustration 1 • 
The total of 120 Items were used as the final course examination for all 
Ss. Part 1 of this exam consisted of the 40 MC Items whereas Part 2 
contained the 80 Tf and tF Items. Each pair of true-false questions that 
were generated from the same MC Item were randomly assigned to the first 
or second set of 40 Items to Part 2. The position of each true-false 
Item was then randomly assigned within each of these two sets* ^ 
one of the Ss began with Part 1 of the exam while the remaining Ss began 
with Part 27 both groups completing all 120 Items. At the end of 40, 

80, and 120 Items, the ^s were requested to record the number of minutes 
required to reach these respective points In the exam, the elapsed time 
being Indicated on the front board. 

Separate reliabilities were computed from the 40 MC, Tf, tF and 
mixed true-false (Mtf) Items. The reliability of ail 80 Mtf Items 
(Tf t tF Items) was obtained and using the jearman-Brown formula, 
the reliability of a 40 Mtf Item test was calculated In order to keep 
test lengths equal for comparative purposes. Average elapsed times 
were computed for MC and true-false Items (times for Tf and tF Items 
could not be computed separately since these Items were Intermixed, 
and for purposes of this study their times were assumed to be equal). 

Using the Spearman-Brown formula the reliabilities of the Tf, tF, and Mtf 
Items were adjusted for differences In time required to answer MC Items. 
There reliabilities were also adjusted using the 2:1 ratio Incorporated 
by Ebel (1971), Again using the Spearman-Brown formula, the required 
ratio of Tf, fT, and Mtf to MC Items required for equivalent reliabilities 
was computed. ^ Ifying the respective ^s scores with the correction 
for guessing formula, a repeated measures ANOVA design was used to 
compare the difficulties of MC, Tf and tF Items. 



1 1 1 ustrat Ion 1 . Sample Items 



A major advantage of Individual Intelligence tests over group tests 
Is that 

A. the standardization group Is usually larger 
*B. Information other than the test score can be obtained 

C. the method of scoring Is more objective 

D. they must be administered by skilled examiners 

T F Individual Intelligence tests are superior to group Intelligence tests 
*“ In the sense that Individual tests provide more Information. 

T F Relative to scoring procedures, individual Intelligence tests are 
“* superior to group Intelligence tests In that Individual tests are 
more objectively scored. 



Results 
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Table 1 provides descriptive statistics related to sections of the 
exams composed of MC, Tf, and tF items. These Indices are also given tor 
the combined true-false (Mtf) items, and for the test as a whole. 

Discriminations are point biserlal correlations between Ss scores 
on Individual Items and total test scores. Reliability coefficients 
were determined using the Kuder-Rlchardson formula No. 20. With 
the exception of the reliability coefficients, the Infonnatlon contained 
In this table Is for background Information only* 

The average amount of time required to answer MC Items was 1.18 
minutes, while the average time was ,68 minutes for true-false Items. 

This resulted In a ratio or 1:1.73 multiple choice Items to true-false 
Items that were answered per unit time. Table 2 provides the 
reliabilities before adjustfrent associated with Items of each format, 
and corresponding reliabilities after adjustments using the Spearman- 
Brown formula. Reliability associated with Mtf Items was adjusted from 
80 to 40 Items for comparative purposes. Each true-false format was 
adjusted, on account of different amounts of time required to answer 
multiple choice and true-false Items, to represent tests 1.73 times 
the length of 40 Items, and slml larly to tests twice as long as 40 Items. 

Table 2 also Indicates the number of test Items of each Item format 
which would have been required per multiple choice Item In order to 
establish equivalent reliabilities. 

The average adjusted scores obtained with the MC, Tf, and tF 
items were 19.91, 14.87 and 12.34 respectively. The hypothesis of / 

equal means was rejected (F=45.99; dfs«2.200; p<.01). Post hoc procedures 
utilizing tho Scheffe techrTlque demonstrated that each mean was 
significantly different from the other two (p<.01). 



Table 1 

Data on Various Item Formats 




Item Format 


MC 


Tf 


tF 


Mtf 


All t terns 


Number of Items 


40 


40 


40 


80 


120 


Mean No. Correct 


24.93 


27.44 


26.17 


53.60 


78.53 


Standard Deviation 


6.18 


3.80 


4.66 


7.07 


12.38 


Median Difficulty 


.640 


.695 


.640 


.645 


.645 


Median Discrimination 


.350 


.195 


.245 


.230 


.265 


Reliability 


.816 


.503 


.648 


.702 


.856 
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Comparison of Reliability Coefficients 



Item Format 



Re! lab II I ties J 
Unad j usted 
Adjusted to 40 items 
Adjusted for time ratio of 1:1.73 
Adjusted for time ratio of 1:2 



Number of Items per MC Item required 
for equivalent reliability 





Tf 


tF 


Mtf 


816 


.503 


.648 


.702 


816 


.503 


.648 


.541 


816 


.636 


.761 


.671 


816 


.669 


.786 


.702 


1.00 


4.38 


2.41 


3.75 



Discussion 

Data from this research have indicated that true-false Items, particularly 
those Items which are in fact true, result In a less reliable test than had 
a four-option multiple choice format been used. This relationship held 
true even when differences In time needed to answer the respective formats 
were taken Into account. The data suggested that approximately two and one- 
half to four and one-half as many true-false as multiple choice Items were / 

necessary In order to produce equivalent reliabilities, this ratio being 
greater than the frequency with which true-false Items would be answered 
relative to multiple choice Items. This would have been the situation even 
had the ratio of true-false to multiple choice Items answered per unit time | 

been 2:^. This supports the conclusion that If the true-false format were 
used In lieu of multiple choice Items for achievement tests administered V 

within a classroom situation, the Increase In content sampling would be 
accomplished at the sacrifice of reliability. 

However, one might Infer that since several of the Items written In the 
true-false format and used In the present study obtained discriminations 
(point-biserlal correlations) within the .45 to .55 range, that with time. 

It would be possible to develop a test consisting entirely of highly 
discriminating true-false Items, whose resulting reliability would 
consistently rival a parallel test using the multiple choice format. But 
It does appear that such a possibility lies closer to the domain of 
standardized tests where extensive Item revision Is more common than with 
the development of teacher-oriented Instruments. 

It also appears that when the correction for guessing formula Is 
applied In order to equalize scores relative to Items correctly answered on 
a pure chance basis, the multiple choice Item Is the easier of the two formats 
to answer, with Items keyed true easier than those keyed false with regard 
to the true-false format. Implications of these results when using multiple 
choice as ppposed to true-false Items, or vice versa, for formative or 
summative evaluation In a mastery learning model are evident. Depending 
on the type of Item format used, the number of objectives Indicated as 
mastered would differ. 
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